NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale

https://doi.org/10.1145/3695053.3731076

Shen, Michael; Umar, Muhammad; Maeng, Kiwan; Suh, G Edward; Gupta, Udit (June 2025, ACM)

Free, publicly-accessible full text available June 20, 2026
Practical Federated Recommendation Model Learning Using ORAM with Controlled Privacy

https://doi.org/10.1145/3676641.3716014

Liu, Jinyu; Xiong, Wenjie; Suh, G Edward; Maeng, Kiwan (March 2025, ACM)

Training high-quality recommendation models requires collecting sensitive user data. The popular privacy-enhancing training method, federated learning (FL), cannot be used practically due to these models’ large embedding tables. This paper introduces FEDORA, a system for training recommendation models with FL. FEDORA allows each user to only download, train, and upload a small subset of the large tables based on their private data, while hiding the access pattern using oblivious memory (ORAM). FEDORA reduces the ORAM’s prohibitive latency and memory overheads by (1) introducing 𝜖-FDP, a formal way to balance the ORAM’s privacy with performance, and (2) placing the large ORAM in a power- and cost-efficient SSD with SSD-friendly optimizations. Additionally, FEDORA is carefully designed to support (3) modern operation modes of FL. FEDORA achieves high model accuracy by using private features during training while achieving, on average, 5× latency and 158× SSD lifetime improvement over the baseline.
more » « less
Free, publicly-accessible full text available March 30, 2026
QoS-Diff: Adaptive Auto-tuning Framework for Low-latency Diffusion Model Inference

https://doi.org/10.1145/3696409.3700277

Huo, Pingyi; Sridhar, Ajay Narayanan; Khan, Md_Fahim Faysal; Maeng, Kiwan; Narayanan, Vijaykrishnan (December 2024, ACM)

Full Text Available
Pirate: No Compromise Low-Bandwidth VR Streaming for Edge Devices

https://doi.org/10.1145/3676641.3716268

Zhang, Yingtian; Kang, Yan; Ying, Ziyu; Lu, Wanhang; Lan, Sijie; Xu, Huijuan; Maeng, Kiwan; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
Optimizing CPU Performance for Recommendation Systems At-Scale

https://doi.org/10.1145/3579371.3589112

Jain, Rishabh; Cheng, Scott; Kalagi, Vishwas; Sanghavi, Vrushabh; Kaul, Samvit; Arunachalam, Meena; Maeng, Kiwan; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut Taylan; et al (June 2023, International Symposium on Computer Architecture 2023)

Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
more » « less
An Architectural Charge Management Interface for Energy-Harvesting Systems

https://doi.org/10.1109/MICRO56248.2022.00034

Ruppel, Emily; Surbatovich, Milijana; Desai, Harsh; Maeng, Kiwan; Lucia, Brandon (October 2022, IEEE/ACM International Symposium on Microarchitecture (MICRO))

Full Text Available
Adaptive low-overhead scheduling for periodic and reactive intermittent execution

https://doi.org/10.1145/3385412.3385998

Maeng, Kiwan; Lucia, Brandon (June 2020, Proceedings ofthe 41st ACMSIGPLANInternational Conference on Programming Language Design and Implementation (PLDI ’20))
null (Ed.)
Full Text Available
Supporting peripherals in intermittent systems with just-in-time checkpoints

https://doi.org/10.1145/3314221.3314613

Maeng, Kiwan; Lucia, Brandon (January 2019, Proceedings of the International Symposium on Programming Language Design and Implementation)

Full Text Available

Search for: All records